10 research outputs found

    Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

    Full text link
    In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at: https://github.com/bmezaris/TextToVideoRetrieval-TtimesVComment: Accepted for publication; to be included in Proc. ECCV Workshops 2022. The version posted here is the "submitted manuscript" versio

    Learning to detect video events from zero or very few video examples

    Get PDF
    In this work we deal with the problem of high-level event detection in video. Specifically, we study the challenging problems of i) learning to detect video events from solely a textual description of the event, without using any positive video examples, and ii) additionally exploiting very few positive training samples together with a small number of ``related'' videos. For learning only from an event's textual description, we first identify a general learning framework and then study the impact of different design choices for various stages of this framework. For additionally learning from example videos, when true positive training samples are scarce, we employ an extension of the Support Vector Machine that allows us to exploit ``related'' event videos by automatically introducing different weights for subsets of the videos in the overall training set. Experimental evaluations performed on the large-scale TRECVID MED 2014 video dataset provide insight on the effectiveness of the proposed methods.Comment: Image and Vision Computing Journal, Elsevier, 2015, accepted for publicatio

    Temporal Lecture Video Fragmentation using Word Embeddings

    No full text
    In this work the problem of temporal video lecture fragmentation in meaningful parts is addressed. The visual content of lecture video can not be effectively used for this task due to its extremely homogeneous content. A new method for lecture video fragmentation in which only automatically generated speech transcripts of a video are exploited, is proposed. Contrary to previously proposed works that employ visual, audio and textual features and use time-consuming supervised methods which require annotated training data, we present a method that analyses the transcripts’ text with the help of word embeddings that are generated from pre-trained state-of-the-art neural networks. Furthermore,we address a major problem of video lecture fragmentation research, which is the lack of large-scale datasets for evaluation, by presenting a new artificially- generated dataset of synthetic video lecture transcripts that we make publicly available. Experimental comparisons document the merit of the proposed approach

    Concept Language Models and Event-based Concept Number Selection for Zero-example Event Detection

    No full text
    <p>Zero-example event detection is a problem where, given an event query as input but no example videos for training a detector, the system retrieves the most closely related videos. In this paper we present a fully-automatic zero-example event detection method that is based on translating the event description to a predefined set of concepts for which previously trained visual concept detectors are available. We adopt the use of Concept Language Models (CLMs), which is a method of augmenting semantic concept definition, and we propose a new concept-selection method for deciding on the appropriate number of the concepts needed to describe an event query. The proposed system achieves state-of-the-art performance in automatic zero-example event detection.</p

    VERGE in VBS 2018

    No full text
    Comunicació presentada a: MultiMedia Modeling 24th International Conference, MMM 2018 celebrada a Bangkok, Tailàndia, del 5 al 7 de febrer de 2018.This paper presents VERGE interactive video retrieval engine, which is capable of browsing and searching into video content. The system integrates several content-based analysis and retrieval mod- ules including concept detection, clustering, visual and textual similarity search, query analysis and reranking, as well as multimodal fusion.This work was supported by the EU's Horizon 2020 research and innovation programme under grant agreements H2020-645012 KRISTINA, H2020-779962 V4Design, H2020-732665 EMMA, and H2020-687786 InVID

    VERGE in VBS 2019

    No full text
    This paper presents VERGE, an interactive video retrieval engine that enables browsing and searching into video content. The system implements various retrieval modalities, such as visual or textual search, concept detection and clustering, as well as a multimodal fusion and a reranking capability. All results are displayed in a graphical user interface in an efficient and friendly manner

    VERGE in VBS 2017

    No full text
    Comunicació presentada a Video Browser Showdown (VBS'17), a 23rd International Conference on MultiMedia Modeling (MMM'17), celebrat el 4 de gener de 2017 a Reykjavik, Islàndia.This paper presents VERGE interactive video retrieval engine, which is capable of browsing and searching into video content. The system integrates several content-based analysis and retrieval modules including concept detec-tion, clustering, visual similarity search, object-based search, query analysis and multimodal and temporal fusion.This work was supported by the EU’s Horizon 2020 research and innovation programme under grant agreements H2020-687786 InVID, H2020-693092 MOVING, H2020-645012 KRISTINA and H2020-700024 TENSOR

    ITI-CERTH participation in TRECVID 2016

    No full text
    <p>This paper provides an overview of the runs submitted to TRECVID 2016 by ITI-CERTH. ITI-CERTH participated in the Ad-hoc Video Search (AVS), Multimedia Event Detection (MED), Instance Search (INS) and Surveillance Event Detection (SED) tasks. Our AVS task participation is based on a method that combines the linguistic analysis of the query and the concept-based annotation of video fragments. In the MED task, in 000Ex task we exploit the textual description of an event class in order retrieve related videos, without using positive samples. Furthermore, in 010Ex and 1000Ex tasks, a kernel sub class version of our discriminant analysis method (KSDA) combined with a fast linear SVM is employed. The INS task is performed by employing VERGE, which is an interactive retrieval application that integrates retrieval functionalities that consider only visual information. For the surveillance event detection (SED) task, we deployed a novel activity detection algorithm that is based on Motion Boundary Activity Areas (MBAA), dense trajectories, Fisher vectors and an overlapping sliding window.</p

    ITI - CERTH in TRECVID 2016 Ad - hoc Video Search (AVS)

    No full text
    <p>This presentation provides an overview of the runs submitted to TRECVID 2016 by ITI-CERTH in the Ad-hoc Video Search (AVS) task. Our AVS task participation is based on a method that combines the linguistic analysis of the query and the concept-based annotation of video fragments.</p
    corecore